SMTPOST Using Statistical Machine Translation Approach in Filipino Part-of-Speech Tagging

نویسندگان

  • Nicco Nocon
  • Allan Borra
چکیده

The field of Natural Language Processing (NLP) in the country has been continually developing. However, the transition between Tagalog to the progressing Filipino language left tools and resources behind. This paper introduces a Statistical Machine Translation Part-of-Speech (POS) Tagger for Filipino (SMTPOST), with the purpose of reviving, updating and widening the scope of technologies in the POS` tagging domain, catering to the changes made by the Filipino language. Resources built are comprised mainly of a tagset (218 tags), parallel corpus (2,668 sentences), affix rules (59 rules) and word-tag dictionary (309 entries). SMTPOST was tested to different tagsets and domains, producing 84.75% as its highest accuracy score, at least 3.75% increase from the available Tagalog POS taggers. Despite SMTPOST’s utilization of Filipino resources and good performance, there are room for improvements and opportunities. Recommendations include a better feature extractor (preferably a morphological analyzer), an increase in scope for all of the resources, implementation of preand/or postprocessing, and the utilization of SMTPOST research to other NLP applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

Clustered Word Classes for Preordering in Statistical Machine Translation

Clustered word classes have been used in connection with statistical machine translation, for instance for improving word alignments. In this work we investigate if clustered word classes can be used in a preordering strategy, where the source language is reordered prior to training and translation. Part-of-speech tagging has previously been successfully used for learning reordering rules that ...

متن کامل

Induction of Fine-Grained Part-of-Speech Taggers via Classifier Combination and Crosslingual Projection

This paper presents an original approach to part-of-speech tagging of fine-grained features (such as case, aspect, and adjective person/number) in languages such as English where these properties are generally not morphologically marked. The goals of such rich lexical tagging in English are to provide additional features for word alignment models in bilingual corpora (for statistical machine tr...

متن کامل

Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction

We define, implement and evaluate a novel model for statistical machine translation, which is based on shallow syntactic analysis (part-of-speech tagging and phrase chunking) in both the source and target languages. It is able to model long-distance constituent motion and other syntactic phenomena without requiring a full parse in either language. We also examine aspects of lexical transfer, su...

متن کامل

A Survey on Statistical Approaches to Natural Language Processing

This survey attempts to catch up with the recent increasing interests in statistical approach to natural language processing based on large corpora. First of all, a historical overview traces back to 1950s when Noam Chomsky proposed his phrase structure transformation grammar and rejected the Markov process natural language modeling. With the development of large corpora and language modeling i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016